Warning
The JupyterLab development team is excited to have a robust
third-party extension community. However, we do not review
third-party extensions, and some extensions may introduce security
risks or contain malicious code that runs on your machine. Moreover in order
to work, this panel needs to fetch data from web services. Do you agree to
activate this feature?
Please read the privacy policy.
Installed
Discover
Open Tabs
Close All
Kernels
Shut Down All
Language servers
Shut Down All
Terminals
Shut Down All
Assignment03.ipynb
No Headings
The table of contents shows headings in notebooks and supported files.
- Contacts9 months ago
- Downloads4 seconds ago
- Favorites10 days ago
- Links9 months ago
- Music9 months ago
- OneDrive12 minutes ago
- Saved Games9 months ago
- Searches9 months ago
- Videos10 days ago
- 02_LogisticRegression.ipynb3 days ago117.1 KB
- 02-Data Types Operators and Strings.ipynb14 days ago195.4 KB
- 03 - Data Structures in Python.ipynb10 days ago637.5 KB
- 04 - Conditions_Branching_Loops.ipynb10 days ago11.3 KB
- 04_KMeansClustering_1.ipynb2 hours ago1.1 MB
- 04_KMeansClustering.ipynb2 hours ago9.2 KB
- 05 - Functions and Classes.ipynb10 days ago47.6 KB
- 08_DataPreProcessing_II-checkpoint.ipynb8 days ago1.5 MB
- 09_Exploratory_Data_Analysis.ipynb8 days ago1018.4 KB
- 10_Model_Development_Linear_Regression.ipynb3 days ago1.4 MB
- Assignment03.ipynb1 minute ago1.4 MB
- Untitled.ipynb15 days ago5.9 KB
- Untitled1.ipynb13 days ago11.6 KB
- Untitled2.ipynb12 days ago1.6 KB
- Untitled3.ipynb8 days ago14.3 KB
- Untitled4.ipynb8 days ago92.8 KB
- Untitled5.ipynb9 days ago72 B
- Untitled6.ipynb8 days ago61.8 KB
- Untitled7.ipynb8 days ago275.5 KB
- Untitled8.ipynb7 days ago32.4 KB
- Untitled9.ipynb3 days ago685.1 KB
Exploratory Data Analysis¶
- Preliminary step in data analysis to;
- Summarize main characteristics of the data
- Gain better understanding of the data set
- Uncover relationship between variables
- Extract important variables
Case Study¶
Reading & Writing Data in Python¶
Exploratory Data Analysis¶
Descriptive Statistics¶
Summarize the categorical data by using value_counts( )¶
Group by Python¶
Correlation¶
Correlation Statistics¶
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
| 1 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
| 2 | 1 | ? | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
| 3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
| 4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
5 rows × 26 columns
C:\Users\SINDH\AppData\Local\Temp\ipykernel_12840\2782822628.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. df["Price"].replace("?", np.nan, inplace = True)
dtype('float64')<Axes: xlabel='drive-wheels', ylabel='Price'>
Text(0, 0.5, 'Price')
| drive-wheels | body-style | Price | |
|---|---|---|---|
| 0 | rwd | convertible | 13495.0 |
| 1 | rwd | convertible | 16500.0 |
| 2 | rwd | hatchback | 16500.0 |
| 3 | fwd | sedan | 13950.0 |
| 4 | 4wd | sedan | 17450.0 |
| ... | ... | ... | ... |
| 200 | rwd | sedan | 16845.0 |
| 201 | rwd | sedan | 19045.0 |
| 202 | rwd | sedan | 21485.0 |
| 203 | rwd | sedan | 22470.0 |
| 204 | rwd | sedan | 22625.0 |
205 rows × 3 columns
| drive-wheels | body-style | Price | |
|---|---|---|---|
| 0 | 4wd | hatchback | 7603.000000 |
| 1 | 4wd | sedan | 12647.333333 |
| 2 | 4wd | wagon | 9095.750000 |
| 3 | fwd | convertible | 11595.000000 |
| 4 | fwd | hardtop | 8249.000000 |
| 5 | fwd | hatchback | 8396.387755 |
| 6 | fwd | sedan | 9811.800000 |
| 7 | fwd | wagon | 9997.333333 |
| 8 | rwd | convertible | 23949.600000 |
| 9 | rwd | hardtop | 24202.714286 |
| 10 | rwd | hatchback | 14337.777778 |
| 11 | rwd | sedan | 21711.833333 |
| 12 | rwd | wagon | 16994.222222 |
| Price | |||||
|---|---|---|---|---|---|
| body-style | convertible | hardtop | hatchback | sedan | wagon |
| drive-wheels | |||||
| 4wd | NaN | NaN | 7603.000000 | 12647.333333 | 9095.750000 |
| fwd | 11595.0 | 8249.000000 | 8396.387755 | 9811.800000 | 9997.333333 |
| rwd | 23949.6 | 24202.714286 | 14337.777778 | 21711.833333 | 16994.222222 |
C:\Users\SINDH\AppData\Local\Temp\ipykernel_12840\878192950.py:3: MatplotlibDeprecationWarning: Getting the array from a PolyQuadMesh will return the full array in the future (uncompressed). To get this behavior now set the PolyQuadMesh with a 2D array .set_array(data2d). plt.colorbar()
| drive-wheels | body-style | Price | |
|---|---|---|---|
| 0 | rwd | convertible | 13495.0 |
| 1 | rwd | convertible | 16500.0 |
| 2 | rwd | hatchback | 16500.0 |
| 3 | fwd | sedan | 13950.0 |
| 4 | 4wd | sedan | 17450.0 |
| ... | ... | ... | ... |
| 200 | rwd | sedan | 16845.0 |
| 201 | rwd | sedan | 19045.0 |
| 202 | rwd | sedan | 21485.0 |
| 203 | rwd | sedan | 22470.0 |
| 204 | rwd | sedan | 22625.0 |
205 rows × 3 columns
| drive-wheels | body-style | Price | |
|---|---|---|---|
| 0 | 4wd | hatchback | 7603.000000 |
| 1 | 4wd | sedan | 12647.333333 |
| 2 | 4wd | wagon | 9095.750000 |
| 3 | fwd | convertible | 11595.000000 |
| 4 | fwd | hardtop | 8249.000000 |
| 5 | fwd | hatchback | 8396.387755 |
| 6 | fwd | sedan | 9811.800000 |
| 7 | fwd | wagon | 9997.333333 |
| 8 | rwd | convertible | 23949.600000 |
| 9 | rwd | hardtop | 24202.714286 |
| 10 | rwd | hatchback | 14337.777778 |
| 11 | rwd | sedan | 21711.833333 |
| 12 | rwd | wagon | 16994.222222 |
| Price | |||||
|---|---|---|---|---|---|
| body-style | convertible | hardtop | hatchback | sedan | wagon |
| drive-wheels | |||||
| 4wd | NaN | NaN | 7603.000000 | 12647.333333 | 9095.750000 |
| fwd | 11595.0 | 8249.000000 | 8396.387755 | 9811.800000 | 9997.333333 |
| rwd | 23949.6 | 24202.714286 | 14337.777778 | 21711.833333 | 16994.222222 |
C:\Users\SINDH\AppData\Local\Temp\ipykernel_12840\2475278966.py:3: MatplotlibDeprecationWarning: Getting the array from a PolyQuadMesh will return the full array in the future (uncompressed). To get this behavior now set the PolyQuadMesh with a 2D array .set_array(data2d). plt.colorbar()
Text(0.5, 1.0, 'Correlation b/w Engine Size and Price')
Text(0.5, 1.0, 'Negative Correlation b/w highway-mpg and Price')
C:\Users\SINDH\AppData\Local\Temp\ipykernel_12840\3306160387.py:2: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. df["horsepower"].replace("?", 0, inplace = True) C:\Users\SINDH\AppData\Local\Temp\ipykernel_12840\3306160387.py:4: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. df["horsepower"].fillna(0, inplace = True) C:\Users\SINDH\AppData\Local\Temp\ipykernel_12840\3306160387.py:5: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. df["Price"].fillna(0, inplace = True)
0.6912878787942788
1.8175735366187578e-30
Model Development¶
- Linear Regression
- Prediction
- Model Evaluation
- Model Evaluation using Visualization
- Polynomial Regression
What is a Model?¶
Linear Regression¶
Predict car price based on highway-mpg¶
Pre-Processing¶
- Check highway-mpg and Price columns
- Should be numeric
- Should not contain any missing data
0 27
1 27
2 26
3 30
4 22
..
200 28
201 25
202 23
203 27
204 25
Name: highway-mpg, Length: 205, dtype: int640
dtype('O')C:\Users\SINDH\AppData\Local\Temp\ipykernel_10120\2913407292.py:2: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. df["Price"].replace("?", np.nan, inplace = True)
4
0
| highway-mpg | Price | |
|---|---|---|
| 0 | 27 | 13495.0 |
| 1 | 27 | 16500.0 |
| 2 | 26 | 16500.0 |
| 3 | 30 | 13950.0 |
| 4 | 22 | 17450.0 |
| ... | ... | ... |
| 200 | 28 | 16845.0 |
| 201 | 25 | 19045.0 |
| 202 | 23 | 21485.0 |
| 203 | 27 | 22470.0 |
| 204 | 25 | 22625.0 |
201 rows × 2 columns
Use Scikit Learn Library for Linear Regression¶
0 13495.0
1 16500.0
2 16500.0
3 13950.0
4 17450.0
...
200 16845.0
201 19045.0
202 21485.0
203 22470.0
204 22625.0
Name: Price, Length: 201, dtype: float64LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
c_0 = 38423.3058581574 c_1 = [-821.73337832]
C:\Users\SINDH\AppData\Roaming\Python\Python312\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names warnings.warn(
array([26097.30518333])
Multi Variable Linear Regression¶
- Predict price of a car based on horsepower, curb-weight, engine-size, highway-mpg
- Make sure to check the types of all columns
horsepower object curb-weight int64 engine-size int64 highway-mpg int64 dtype: object
C:\Users\SINDH\AppData\Local\Temp\ipykernel_10120\2628235679.py:2: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. df["horsepower"].replace("?", np.nan, inplace = True)
2
C:\Users\SINDH\AppData\Local\Temp\ipykernel_10120\3197711793.py:3: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. df["horsepower"].replace(np.nan, ave_hp, inplace = True)
0
0 0 0 0
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
c_0 = -15824.038208234477
c_{1-4} = [53.61042729 4.70886444 81.47225667 36.39637823]
C:\Users\SINDH\AppData\Roaming\Python\Python312\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names warnings.warn(
array([87642.15866469])
Model Evaluation¶
- Mean Squared Error (MSE)
- R Squared Error
11976801.681229591
Model Evaluation using Visualization¶
If the residual values are randomly spread out around x-axis then a linear model is appropriate
Problem Statement¶
The following features are available for California houses in a specific locality obtained from 1990 census data;
- Longitude
- Latitude
- Housing Median Age
- Total Rooms
- Total Bedrooms
- Population
- Households
- Median Income
- Median House Value
- Ocean Proximity
Create clusters/groups of houses based on selected set of features.
Acknowledgement / Source¶
Importing Libraries¶
Loading the Dataset¶
Visualize the Data¶
Pre-Processing¶
Model¶
Silhouette (si·loo·et) Score¶
- Scores closer to 1: Indicate well-separated clusters, suggesting the clustering is likely effective in capturing the underlying structure in the data.
- Scores around 0: Indicate clusters with some overlap, and you might consider adjusting the number of clusters or the clustering algorithm to see if you can achieve better separation.
- Negative scores: Suggest that some data points are potentially assigned to the wrong cluster, and you might need to explore alternative clustering strategies.
Choosing the Number of Clusters¶
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
| 1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
| 2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
| 3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
| 4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
| longitude | latitude | median_house_value | |
|---|---|---|---|
| 0 | -122.23 | 37.88 | 452600.0 |
| 1 | -122.22 | 37.86 | 358500.0 |
| 2 | -122.24 | 37.85 | 352100.0 |
| 3 | -122.25 | 37.85 | 341300.0 |
| 4 | -122.25 | 37.85 | 342200.0 |
(20640, 3)
<Axes: xlabel='longitude', ylabel='latitude'>
KMeans(n_clusters=3, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=3, random_state=0)
<Axes: xlabel='longitude', ylabel='latitude'>
<Axes: ylabel='median_house_value'>
0.7499115323584772
<Axes: xlabel='longitude', ylabel='latitude'>
<Axes: xlabel='longitude', ylabel='latitude'>
<Axes: xlabel='longitude', ylabel='latitude'>
<Axes: >
<Axes: xlabel='longitude', ylabel='latitude'>
<Axes: ylabel='median_house_value'>
| housing_median_age | total_rooms | total_bedrooms | population | |
|---|---|---|---|---|
| count | 20640.000000 | 20640.000000 | 20433.000000 | 20640.000000 |
| mean | 28.639486 | 2635.763081 | 537.870553 | 1425.476744 |
| std | 12.585558 | 2181.615252 | 421.385070 | 1132.462122 |
| min | 1.000000 | 2.000000 | 1.000000 | 3.000000 |
| 25% | 18.000000 | 1447.750000 | 296.000000 | 787.000000 |
| 50% | 29.000000 | 2127.000000 | 435.000000 | 1166.000000 |
| 75% | 37.000000 | 3148.000000 | 647.000000 | 1725.000000 |
| max | 52.000000 | 39320.000000 | 6445.000000 | 35682.000000 |
| housing_median_age | total_rooms | total_bedrooms | population | |
|---|---|---|---|---|
| 0 | 41.0 | 880.0 | 129.0 | 322.0 |
| 1 | 21.0 | 7099.0 | 1106.0 | 2401.0 |
| 2 | 52.0 | 1467.0 | 190.0 | 496.0 |
| 3 | 52.0 | 1274.0 | 235.0 | 558.0 |
| 4 | 52.0 | 1627.0 | 280.0 | 565.0 |
| 5 | 52.0 | 919.0 | 213.0 | 413.0 |
| 6 | 52.0 | 2535.0 | 489.0 | 1094.0 |
| 7 | 52.0 | 3104.0 | 687.0 | 1157.0 |
| 8 | 42.0 | 2555.0 | 665.0 | 1206.0 |
| 9 | 52.0 | 3549.0 | 707.0 | 1551.0 |
C:\Users\SINDH\AppData\Local\Temp\ipykernel_1252\245617363.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. selected_features["housing_median_age"].replace("?", np.nan, inplace = True) C:\Users\SINDH\AppData\Local\Temp\ipykernel_1252\245617363.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy selected_features["housing_median_age"].replace("?", np.nan, inplace = True)
0
C:\Users\SINDH\AppData\Local\Temp\ipykernel_1252\668685475.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. selected_features["total_rooms"].replace("?", np.nan, inplace = True) C:\Users\SINDH\AppData\Local\Temp\ipykernel_1252\668685475.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy selected_features["total_rooms"].replace("?", np.nan, inplace = True)
0
C:\Users\SINDH\AppData\Local\Temp\ipykernel_1252\344786479.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. selected_features["total_bedrooms"].replace("?", np.nan, inplace= True) C:\Users\SINDH\AppData\Local\Temp\ipykernel_1252\344786479.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy selected_features["total_bedrooms"].replace("?", np.nan, inplace= True)
207
C:\Users\SINDH\AppData\Local\Temp\ipykernel_1252\2879441710.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. selected_features["population"].replace("?", np.nan, inplace = True) C:\Users\SINDH\AppData\Local\Temp\ipykernel_1252\2879441710.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy selected_features["population"].replace("?", np.nan, inplace = True)
0
C:\Users\SINDH\AppData\Local\Temp\ipykernel_1252\3851724696.py:2: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. selected_features["total_bedrooms"].replace(np.nan, selected_features_mean, inplace= True) C:\Users\SINDH\AppData\Local\Temp\ipykernel_1252\3851724696.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy selected_features["total_bedrooms"].replace(np.nan, selected_features_mean, inplace= True)
0
population
3.0 6.00
5.0 3.00
6.0 2.00
8.0 3.25
9.0 7.00
...
15507.0 5290.00
16122.0 5471.00
16305.0 6210.00
28566.0 6445.00
35682.0 4819.00
Name: total_bedrooms, Length: 3888, dtype: float64[[0.04330435 0.92945912 0.13625026 0.34009754] [0.00277219 0.93713188 0.14600195 0.3169536 ] [0.03331069 0.93974594 0.12171215 0.31773278] ... [0.00675685 0.89587898 0.19276899 0.40024407] [0.008808 0.91016017 0.20013737 0.36259607] [0.00504461 0.87807691 0.19421737 0.43730437]]
array([[ 0.63157895, -0.73342156, -0.89241877, -0.89978678],
[-0.42105263, 2.92427584, 1.92924188, 1.31663113],
[ 1.21052632, -0.38817821, -0.71624549, -0.71428571],
...,
[-0.63157895, 0.0746949 , 0.13574007, -0.16950959],
[-0.57894737, -0.15703573, -0.08375451, -0.45309168],
[-0.68421053, 0.38700191, 0.51407942, 0.23560768]])array([[ 0.39606755, -0.45993376, -0.55964202, -0.56426255],
[-0.11179811, 0.77645525, 0.51225331, 0.34959258],
[ 0.74513076, -0.2389403 , -0.44087975, -0.43967343],
...,
[-0.93980159, 0.11114744, 0.20198383, -0.25223353],
[-0.76539427, -0.20760825, -0.11072721, -0.59900744],
[-0.70657304, 0.39965056, 0.53088143, 0.2433082 ]])KMeans(n_clusters=4, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=4, random_state=0)
| PC1 | PC2 | |
|---|---|---|
| 0 | 1.009226 | -0.073469 |
| 1 | -0.923991 | -0.085173 |
| 2 | 0.846904 | -0.475885 |
| 3 | 0.844834 | -0.508459 |
| 4 | 0.769151 | -0.577098 |
<Axes: xlabel='PC1', ylabel='PC2'>
<Axes: xlabel='PC1', ylabel='PC2'>
<Axes: ylabel='PC1'>
0.5273664489203738
Logistic Regression¶
- The notebook implements a logistic regression model to classify bank notes as 'authentic' or 'fake'
- We use a data set with the following features;
- Variance of Wavelet Transformed image (continuous)
- Skewness of Wavelet Transformed image (continuous)
- Curtosis of Wavelet Transformed image (continuous)
- Entropy of image (continuous)
- Class (integer)
- Total Instance : 1372
- Data Source
Data Splitting¶
Evalutation Measures¶
Confusion Matrix¶
Image Source : https://www.evidentlyai.com/classification-metrics/confusion-matrix
Visualizing Confusion Matrix¶
| Area | Bedrooms | Bathrooms | Stories | Mainroad | Guestroom | Basement | Hotwaterheating | Airconditioning | Parking | Furnishingstatus | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7420 | 4 | 2 | 3 | yes | no | no | no | yes | 2 | furnished | 13300000 |
| 1 | 8960 | ? | 4 | 4 | yes | no | no | no | yes | 3 | furnished | 12250000 |
| 2 | ? | 3 | 2 | 2 | yes | no | yes | no | no | 2 | semi-furnished | 12250000 |
| 3 | 7500 | 4 | 2 | 2 | yes | no | yes | no | yes | 3 | furnished | 12215000 |
| 4 | 7420 | 4 | 1 | 2 | yes | yes | yes | no | yes | 2 | furnished | 11410000 |
| 5 | 7500 | 3 | 3 | 1 | yes | no | yes | no | yes | 2 | semi-furnished | 10850000 |
| 6 | 8580 | 4 | 3 | 4 | yes | no | no | no | yes | 2 | semi-furnished | 10150000 |
| 7 | 16200 | 5 | 3 | ? | yes | no | no | no | no | 0 | unfurnished | 10150000 |
| 8 | 8100 | 4 | 1 | 2 | yes | yes | yes | no | yes | 2 | furnished | 9870000 |
| 9 | 5750 | 3 | 2 | 4 | yes | yes | no | no | yes | 1 | unfurnished | 9800000 |
| Area | Bedrooms | Bathrooms | Stories | Mainroad | Parking | Furnishingstatus | Price | |
|---|---|---|---|---|---|---|---|---|
| 0 | 7420 | 4 | 2 | 3 | yes | 2 | furnished | 13300000 |
| 1 | 8960 | ? | 4 | 4 | yes | 3 | furnished | 12250000 |
| 2 | ? | 3 | 2 | 2 | yes | 2 | semi-furnished | 12250000 |
| 3 | 7500 | 4 | 2 | 2 | yes | 3 | furnished | 12215000 |
| 4 | 7420 | 4 | 1 | 2 | yes | 2 | furnished | 11410000 |
| 5 | 7500 | 3 | 3 | 1 | yes | 2 | semi-furnished | 10850000 |
| 6 | 8580 | 4 | 3 | 4 | yes | 2 | semi-furnished | 10150000 |
| 7 | 16200 | 5 | 3 | ? | yes | 0 | unfurnished | 10150000 |
| 8 | 8100 | 4 | 1 | 2 | yes | 2 | furnished | 9870000 |
| 9 | 5750 | 3 | 2 | 4 | yes | 1 | unfurnished | 9800000 |
0
Area 0 Bedrooms 0 Bathrooms 0 Stories 0 Mainroad 0 Parking 0 Furnishingstatus 0 Price 0 dtype: int64
| Area | Bedrooms | Bathrooms | Stories | Mainroad | Parking | Furnishingstatus | Price | |
|---|---|---|---|---|---|---|---|---|
| 0 | 7420 | 4 | 2 | 3 | yes | 2 | furnished | 13300000 |
| 1 | 8960 | ? | 4 | 4 | yes | 3 | furnished | 12250000 |
| 2 | ? | 3 | 2 | 2 | yes | 2 | semi-furnished | 12250000 |
| 3 | 7500 | 4 | 2 | 2 | yes | 3 | furnished | 12215000 |
| 4 | 7420 | 4 | 1 | 2 | yes | 2 | furnished | 11410000 |
| 5 | 7500 | 3 | 3 | 1 | yes | 2 | semi-furnished | 10850000 |
| 6 | 8580 | 4 | 3 | 4 | yes | 2 | semi-furnished | 10150000 |
| 7 | 16200 | 5 | 3 | ? | yes | 0 | unfurnished | 10150000 |
| 8 | 8100 | 4 | 1 | 2 | yes | 2 | furnished | 9870000 |
| 9 | 5750 | 3 | 2 | 4 | yes | 1 | unfurnished | 9800000 |
C:\Users\SINDH\AppData\Local\Temp\ipykernel_10144\1078061793.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. df["Area"].replace("?", np.nan, inplace = True)
9
C:\Users\SINDH\AppData\Local\Temp\ipykernel_10144\2500254421.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. df["Bedrooms"].replace("?", np.nan, inplace = True)
dtype('float64')7
C:\Users\SINDH\AppData\Local\Temp\ipykernel_10144\2709543407.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. df["Bathrooms"].replace("?", np.nan, inplace = True)
5
C:\Users\SINDH\AppData\Local\Temp\ipykernel_10144\1854985383.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. df["Stories"].replace("?", np.nan, inplace = True)
7
C:\Users\SINDH\AppData\Local\Temp\ipykernel_10144\959753586.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. df["Mainroad"].replace("?", np.nan, inplace = True)
0
C:\Users\SINDH\AppData\Local\Temp\ipykernel_10144\1665768048.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. df["Parking"].replace("?", np.nan, inplace = True)
6
C:\Users\SINDH\AppData\Local\Temp\ipykernel_10144\2512529072.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. df["Furnishingstatus"].replace("?", np.nan, inplace = True)
0
C:\Users\SINDH\AppData\Local\Temp\ipykernel_10144\2782822628.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. df["Price"].replace("?", np.nan, inplace = True)
3
0 13300000.0
1 12250000.0
2 12250000.0
3 12215000.0
4 11410000.0
...
540 1820000.0
541 1767150.0
542 1750000.0
543 1750000.0
544 1750000.0
Name: Price, Length: 542, dtype: float64| Area | Bedrooms | Bathrooms | Stories | Parking | Price | |
|---|---|---|---|---|---|---|
| count | 533.000000 | 535.000000 | 537.000000 | 535.000000 | 536.000000 | 5.420000e+02 |
| mean | 5155.748593 | 2.971963 | 1.286778 | 1.809346 | 0.690299 | 4.767167e+06 |
| std | 2167.206723 | 0.737669 | 0.503414 | 0.872546 | 0.860981 | 1.875564e+06 |
| min | 1650.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 1.750000e+06 |
| 25% | 3600.000000 | 3.000000 | 1.000000 | 1.000000 | 0.000000 | 3.430000e+06 |
| 50% | 4600.000000 | 3.000000 | 1.000000 | 2.000000 | 0.000000 | 4.340000e+06 |
| 75% | 6360.000000 | 3.000000 | 2.000000 | 2.000000 | 1.000000 | 5.766250e+06 |
| max | 16200.000000 | 6.000000 | 4.000000 | 4.000000 | 3.000000 | 1.330000e+07 |
0 7420.0
1 8960.0
2 NaN
3 7500.0
4 7420.0
...
540 3000.0
541 2400.0
542 3620.0
543 2910.0
544 3850.0
Name: Area, Length: 542, dtype: float64| Area | Bedrooms | Bathrooms | Stories | Mainroad | Parking | Furnishingstatus | Price | |
|---|---|---|---|---|---|---|---|---|
| 0 | 7420.000000 | 4.0 | 2.0 | 3.0 | yes | 2.0 | furnished | 13300000.0 |
| 1 | 8960.000000 | NaN | 4.0 | 4.0 | yes | 3.0 | furnished | 12250000.0 |
| 2 | 5155.748593 | 3.0 | 2.0 | 2.0 | yes | 2.0 | semi-furnished | 12250000.0 |
| 3 | 7500.000000 | 4.0 | 2.0 | 2.0 | yes | 3.0 | furnished | 12215000.0 |
| 4 | 7420.000000 | 4.0 | 1.0 | 2.0 | yes | 2.0 | furnished | 11410000.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 75 | 4260.000000 | 4.0 | 2.0 | 2.0 | yes | 0.0 | semi-furnished | 6650000.0 |
| 76 | 5155.748593 | 3.0 | 2.0 | 3.0 | yes | 0.0 | furnished | 6650000.0 |
| 77 | 6500.000000 | 3.0 | 2.0 | 3.0 | yes | 0.0 | furnished | 6650000.0 |
| 78 | 5700.000000 | 3.0 | 1.0 | 1.0 | yes | 2.0 | furnished | 6650000.0 |
| 79 | 6000.000000 | 3.0 | 2.0 | 3.0 | yes | 0.0 | furnished | 6650000.0 |
80 rows × 8 columns
| Area | Bedrooms | |
|---|---|---|
| 0 | 7420.000000 | 4.0 |
| 1 | 8960.000000 | NaN |
| 2 | 5155.748593 | 3.0 |
| 3 | 7500.000000 | 4.0 |
| 4 | 7420.000000 | 4.0 |
| ... | ... | ... |
| 540 | 3000.000000 | 2.0 |
| 541 | 2400.000000 | 3.0 |
| 542 | 3620.000000 | 2.0 |
| 543 | 2910.000000 | 3.0 |
| 544 | 3850.000000 | 3.0 |
542 rows × 2 columns
| Area | Bedrooms | Bathrooms | Stories | Mainroad | Parking | Furnishingstatus | Price | |
|---|---|---|---|---|---|---|---|---|
| 0 | 7420.000000 | 4.0 | 2.0 | 3.0 | yes | 2.0 | furnished | 13300000.0 |
| 1 | 8960.000000 | NaN | 4.0 | 4.0 | yes | 3.0 | furnished | 12250000.0 |
| 2 | 5155.748593 | 3.0 | 2.0 | 2.0 | yes | 2.0 | semi-furnished | 12250000.0 |
| 3 | 7500.000000 | 4.0 | 2.0 | 2.0 | yes | 3.0 | furnished | 12215000.0 |
| 4 | 7420.000000 | 4.0 | 1.0 | 2.0 | yes | 2.0 | furnished | 11410000.0 |
| 5 | 7500.000000 | 3.0 | 3.0 | 1.0 | yes | 2.0 | semi-furnished | 10850000.0 |
| 6 | 8580.000000 | 4.0 | 3.0 | 4.0 | yes | 2.0 | semi-furnished | 10150000.0 |
| 7 | 16200.000000 | 5.0 | 3.0 | NaN | yes | 0.0 | unfurnished | 10150000.0 |
| 8 | 8100.000000 | 4.0 | 1.0 | 2.0 | yes | 2.0 | furnished | 9870000.0 |
| 9 | 5750.000000 | 3.0 | 2.0 | 4.0 | yes | 1.0 | unfurnished | 9800000.0 |
| 10 | 13200.000000 | 3.0 | 1.0 | 2.0 | yes | 2.0 | furnished | 9800000.0 |
| 11 | 6000.000000 | 4.0 | 3.0 | 2.0 | yes | 2.0 | semi-furnished | 9681000.0 |
| 12 | 6550.000000 | 4.0 | 2.0 | 2.0 | yes | 1.0 | semi-furnished | 9310000.0 |
| 13 | 3500.000000 | 4.0 | 2.0 | 2.0 | yes | 2.0 | furnished | 9240000.0 |
| 14 | 7800.000000 | 3.0 | 2.0 | 2.0 | yes | 0.0 | semi-furnished | 9240000.0 |
| 15 | 6000.000000 | 4.0 | 1.0 | 2.0 | yes | 2.0 | semi-furnished | 9100000.0 |
| 16 | 6600.000000 | 4.0 | 2.0 | 2.0 | yes | 1.0 | unfurnished | 9100000.0 |
| 17 | 8500.000000 | 3.0 | 2.0 | 4.0 | yes | 2.0 | furnished | 8960000.0 |
| 18 | 5155.748593 | 3.0 | 2.0 | 2.0 | yes | 2.0 | furnished | 8890000.0 |
| 19 | 6420.000000 | 3.0 | 2.0 | 2.0 | yes | 1.0 | semi-furnished | 8855000.0 |
| 20 | 4320.000000 | 3.0 | 1.0 | 2.0 | yes | 2.0 | semi-furnished | 8750000.0 |
| 21 | 7155.000000 | 3.0 | 2.0 | 1.0 | yes | 2.0 | unfurnished | 8680000.0 |
| 22 | 8050.000000 | 3.0 | 1.0 | 1.0 | yes | 1.0 | furnished | 8645000.0 |
| 23 | 4560.000000 | 3.0 | 2.0 | NaN | yes | 1.0 | furnished | 8645000.0 |
| 24 | 8800.000000 | 3.0 | 2.0 | 2.0 | yes | 2.0 | furnished | 8575000.0 |
| 25 | 6540.000000 | 4.0 | 2.0 | 2.0 | yes | 2.0 | furnished | 8540000.0 |
| 26 | 6000.000000 | 3.0 | 2.0 | 4.0 | yes | 0.0 | semi-furnished | 8463000.0 |
| 27 | 8875.000000 | NaN | 1.0 | 1.0 | yes | 1.0 | semi-furnished | 8400000.0 |
| 28 | 7950.000000 | 5.0 | 2.0 | 2.0 | yes | 2.0 | unfurnished | 8400000.0 |
| 29 | 5500.000000 | 4.0 | 2.0 | 2.0 | yes | 1.0 | semi-furnished | 8400000.0 |
| 30 | 7475.000000 | 3.0 | 2.0 | 4.0 | yes | 2.0 | unfurnished | 8400000.0 |
| 31 | 7000.000000 | 3.0 | 1.0 | 4.0 | yes | 2.0 | semi-furnished | 8400000.0 |
| 32 | 4880.000000 | 4.0 | 2.0 | 2.0 | yes | NaN | furnished | 8295000.0 |
| 33 | 5960.000000 | 3.0 | 3.0 | 2.0 | yes | 1.0 | unfurnished | 8190000.0 |
| 34 | 6840.000000 | 5.0 | 1.0 | 2.0 | yes | 1.0 | furnished | 8120000.0 |
| 35 | 7000.000000 | 3.0 | 2.0 | 4.0 | yes | 2.0 | furnished | 8080940.0 |
| 36 | 7482.000000 | 3.0 | 2.0 | 3.0 | yes | 1.0 | furnished | 8043000.0 |
| 37 | 9000.000000 | 4.0 | 2.0 | 4.0 | yes | 2.0 | furnished | 7980000.0 |
| 38 | 6000.000000 | 3.0 | 1.0 | 4.0 | yes | 2.0 | unfurnished | 7962500.0 |
| 39 | 6000.000000 | 4.0 | NaN | 4.0 | yes | 1.0 | semi-furnished | 7910000.0 |
3
C:\Users\SINDH\AppData\Local\Temp\ipykernel_10144\2436856013.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. df["Bedrooms"].replace(np.nan, mean_bedrooms, inplace=True)
| Bedrooms | Area | |
|---|---|---|
| 0 | 4.0 | 7420.000000 |
| 1 | 3.0 | 8960.000000 |
| 2 | 3.0 | 5155.748593 |
| 3 | 4.0 | 7500.000000 |
| 4 | 4.0 | 7420.000000 |
| ... | ... | ... |
| 540 | 2.0 | 3000.000000 |
| 541 | 3.0 | 2400.000000 |
| 542 | 2.0 | 3620.000000 |
| 543 | 3.0 | 2910.000000 |
| 544 | 3.0 | 3850.000000 |
542 rows × 2 columns
2
C:\Users\SINDH\AppData\Local\Temp\ipykernel_10144\3660585759.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. df["Bathrooms"].replace(np.nan, mean_bathrooms, inplace=True)
2
C:\Users\SINDH\AppData\Local\Temp\ipykernel_10144\1488270682.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. df["Stories"].replace(np.nan, mean_stories, inplace = True)
1
C:\Users\SINDH\AppData\Local\Temp\ipykernel_10144\243805928.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. df["Parking"].replace(np.nan, mean_parking, inplace=True)
['yes' 'no']
['furnished' 'semi-furnished' 'unfurnished']
Mainroad yes 465 no 77 Name: count, dtype: int64
Furnishingstatus semi-furnished 226 unfurnished 178 furnished 138 Name: count, dtype: int64
Area Bedrooms Bathrooms Stories Parking Price \
0 7420.000000 4.0 2.0 3.0 2.0 13300000.0
1 8960.000000 3.0 4.0 4.0 3.0 12250000.0
2 5155.748593 3.0 2.0 2.0 2.0 12250000.0
3 7500.000000 4.0 2.0 2.0 3.0 12215000.0
4 7420.000000 4.0 1.0 2.0 2.0 11410000.0
.. ... ... ... ... ... ...
540 3000.000000 2.0 1.0 1.0 2.0 1820000.0
541 2400.000000 3.0 1.0 1.0 0.0 1767150.0
542 3620.000000 2.0 1.0 1.0 0.0 1750000.0
543 2910.000000 3.0 1.0 1.0 0.0 1750000.0
544 3850.000000 3.0 1.0 2.0 0.0 1750000.0
Mainroad_no Mainroad_yes Furnishingstatus_furnished \
0 False True True
1 False True True
2 False True False
3 False True True
4 False True True
.. ... ... ...
540 False True False
541 True False False
542 False True False
543 True False True
544 False True False
Furnishingstatus_semi-furnished Furnishingstatus_unfurnished
0 False False
1 False False
2 True False
3 False False
4 False False
.. ... ...
540 False True
541 True False
542 False True
543 False False
544 False True
[542 rows x 11 columns]
| Area | Bedrooms | Bathrooms | Stories | Parking | Price | Mainroad_no | Mainroad_yes | Furnishingstatus_furnished | Furnishingstatus_semi-furnished | Furnishingstatus_unfurnished | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7420.000000 | 4.0 | 2.0 | 3.0 | 2.0 | 13300000.0 | False | True | True | False | False |
| 1 | 8960.000000 | 3.0 | 4.0 | 4.0 | 3.0 | 12250000.0 | False | True | True | False | False |
| 2 | 5155.748593 | 3.0 | 2.0 | 2.0 | 2.0 | 12250000.0 | False | True | False | True | False |
| 3 | 7500.000000 | 4.0 | 2.0 | 2.0 | 3.0 | 12215000.0 | False | True | True | False | False |
| 4 | 7420.000000 | 4.0 | 1.0 | 2.0 | 2.0 | 11410000.0 | False | True | True | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 540 | 3000.000000 | 2.0 | 1.0 | 1.0 | 2.0 | 1820000.0 | False | True | False | False | True |
| 541 | 2400.000000 | 3.0 | 1.0 | 1.0 | 0.0 | 1767150.0 | True | False | False | True | False |
| 542 | 3620.000000 | 2.0 | 1.0 | 1.0 | 0.0 | 1750000.0 | False | True | False | False | True |
| 543 | 2910.000000 | 3.0 | 1.0 | 1.0 | 0.0 | 1750000.0 | True | False | True | False | False |
| 544 | 3850.000000 | 3.0 | 1.0 | 2.0 | 0.0 | 1750000.0 | False | True | False | False | True |
542 rows × 11 columns
0 True
1 True
2 False
3 True
4 True
...
540 False
541 False
542 False
543 True
544 False
Name: Furnishingstatus_furnished, Length: 542, dtype: bool0 0 0 0 0 0 0 0 0 0 0
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
c_0 = -55069.20985339861
c_{1-4} = [ 2.76303444e+02 2.02463933e+05 1.12200914e+06 4.96912078e+05
3.47723155e+05 -3.03209898e+05 3.03209898e+05 2.70695767e+05
6.91308081e+04 -3.39826575e+05]
C:\Users\SINDH\AppData\Roaming\Python\Python312\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names warnings.warn(
array([7809064.56773169])
1472162563248.0195
(0.0, 13877500.0)
<Axes: xlabel='Bedrooms', ylabel='Price'>
<Axes: xlabel='Area', ylabel='Price'>
| Patient ID | Age | BMI | Diagnosis | Blood Pressure | |
|---|---|---|---|---|---|
| 0 | P0001 | 64 | 23.416480 | NaN | 119/66 |
| 1 | P0002 | 49 | 30.539825 | NaN | 103/62 |
| 2 | P0003 | 68 | 31.654859 | Hypertension | 98/70 |
| 3 | P0004 | 22 | NaN | Hypertension | 117/87 |
| 4 | P0005 | 42 | NaN | Hypertension | NaN |
Patient ID object Age int64 BMI float64 Diagnosis object Blood Pressure object dtype: object
| Patient ID | Age | BMI | Diagnosis | Blood Pressure | |
|---|---|---|---|---|---|
| 0 | False | False | False | True | False |
| 1 | False | False | False | True | False |
| 2 | False | False | False | False | False |
| 3 | False | False | True | False | False |
| 4 | False | False | True | False | True |
| 5 | False | False | True | False | False |
| 6 | False | False | False | False | False |
| 7 | False | False | False | True | False |
| 8 | False | False | False | True | False |
| 9 | False | False | False | True | True |
| 10 | False | False | True | False | False |
| 11 | False | False | True | False | False |
| 12 | False | False | True | False | True |
| 13 | False | False | False | False | False |
| 14 | False | False | True | False | False |
| 15 | False | False | True | True | True |
| 16 | False | False | True | False | False |
| 17 | False | False | False | False | True |
| 18 | False | False | False | False | False |
| 19 | False | False | False | False | True |
| 20 | False | False | False | False | False |
| 21 | False | False | True | False | True |
| 22 | False | False | False | True | False |
| 23 | False | False | True | True | False |
| 24 | False | False | True | False | False |
| 25 | False | False | True | False | False |
| 26 | False | False | False | False | False |
| 27 | False | False | False | True | True |
| 28 | False | False | False | False | True |
| 29 | False | False | True | False | False |
Patient ID 0 Age 0 BMI 14 Diagnosis 9 Blood Pressure 9 dtype: int64
C:\Users\SINDH\AppData\Local\Temp\ipykernel_10184\4099862312.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. df["BMI"].replace(np.nan, mean, inplace = True)
| Patient ID | Age | BMI | Diagnosis | Blood Pressure | |
|---|---|---|---|---|---|
| 0 | P0001 | 64 | 23.416480 | NaN | 119/66 |
| 1 | P0002 | 49 | 30.539825 | NaN | 103/62 |
| 2 | P0003 | 68 | 31.654859 | Hypertension | 98/70 |
| 3 | P0004 | 22 | 27.203652 | Hypertension | 117/87 |
| 4 | P0005 | 42 | 27.203652 | Hypertension | NaN |
| 5 | P0006 | 29 | 27.203652 | Hypertension | 139/81 |
| 6 | P0007 | 21 | 20.154990 | Diabetes | 115/60 |
| 7 | P0008 | 51 | 21.675982 | NaN | 137/65 |
| 8 | P0009 | 42 | 31.285261 | NaN | 123/71 |
| 9 | P0010 | 35 | 33.454012 | NaN | NaN |
| 10 | P0011 | 62 | 27.203652 | Diabetes | 111/84 |
| 11 | P0012 | 23 | 27.203652 | Hypertension | 117/75 |
| 12 | P0013 | 46 | 27.203652 | Hypertension | NaN |
| 13 | P0014 | 40 | 31.369513 | Hypertension | 102/69 |
| 14 | P0015 | 27 | 27.203652 | Diabetes | 92/75 |
| 15 | P0016 | 21 | 27.203652 | NaN | NaN |
| 16 | P0017 | 69 | 27.203652 | Hypertension | 131/82 |
| 17 | P0018 | 43 | 26.752736 | Hypertension | NaN |
| 18 | P0019 | 23 | 27.885888 | Hypertension | 110/81 |
| 19 | P0020 | 38 | 29.278378 | Hypertension | NaN |
| 20 | P0021 | 63 | 23.506374 | Diabetes | 100/89 |
| 21 | P0022 | 56 | 27.203652 | Diabetes | NaN |
| 22 | P0023 | 43 | 29.850859 | NaN | 112/82 |
| 23 | P0024 | 38 | 27.203652 | NaN | 92/60 |
| 24 | P0025 | 35 | 27.203652 | Hypertension | 90/69 |
| 25 | P0026 | 51 | 27.203652 | Diabetes | 113/70 |
| 26 | P0027 | 65 | 23.628654 | Diabetes | 107/74 |
| 27 | P0028 | 46 | 22.254053 | NaN | NaN |
| 28 | P0029 | 39 | 28.550565 | Hypertension | NaN |
| 29 | P0030 | 60 | 27.203652 | Diabetes | 137/75 |
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[11], line 1 ----> 1 mean_bp = df["Blood Pressure"].mean() File ~\AppData\Roaming\Python\Python312\site-packages\pandas\core\series.py:6529, in Series.mean(self, axis, skipna, numeric_only, **kwargs) 6521 @doc(make_doc("mean", ndim=1)) 6522 def mean( 6523 self, (...) 6527 **kwargs, 6528 ): -> 6529 return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs) File ~\AppData\Roaming\Python\Python312\site-packages\pandas\core\generic.py:12413, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs) 12406 def mean( 12407 self, 12408 axis: Axis | None = 0, (...) 12411 **kwargs, 12412 ) -> Series | float: > 12413 return self._stat_function( 12414 "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs 12415 ) File ~\AppData\Roaming\Python\Python312\site-packages\pandas\core\generic.py:12370, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs) 12366 nv.validate_func(name, (), kwargs) 12368 validate_bool_kwarg(skipna, "skipna", none_allowed=False) > 12370 return self._reduce( 12371 func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only 12372 ) File ~\AppData\Roaming\Python\Python312\site-packages\pandas\core\series.py:6437, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds) 6432 # GH#47500 - change to TypeError to match other methods 6433 raise TypeError( 6434 f"Series.{name} does not allow {kwd_name}={numeric_only} " 6435 "with non-numeric dtypes." 6436 ) -> 6437 return op(delegate, skipna=skipna, **kwds) File ~\AppData\Roaming\Python\Python312\site-packages\pandas\core\nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds) 145 result = alt(values, axis=axis, skipna=skipna, **kwds) 146 else: --> 147 result = alt(values, axis=axis, skipna=skipna, **kwds) 149 return result File ~\AppData\Roaming\Python\Python312\site-packages\pandas\core\nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs) 401 if datetimelike and mask is None: 402 mask = isna(values) --> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs) 406 if datetimelike: 407 result = _wrap_results(result, orig_values.dtype, fill_value=iNaT) File ~\AppData\Roaming\Python\Python312\site-packages\pandas\core\nanops.py:719, in nanmean(values, axis, skipna, mask) 716 dtype_count = dtype 718 count = _get_counts(values.shape, mask, axis, dtype=dtype_count) --> 719 the_sum = values.sum(axis, dtype=dtype_sum) 720 the_sum = _ensure_numeric(the_sum) 722 if axis is not None and getattr(the_sum, "ndim", False): File ~\AppData\Roaming\Python\Python312\site-packages\numpy\core\_methods.py:49, in _sum(a, axis, dtype, out, keepdims, initial, where) 47 def _sum(a, axis=None, dtype=None, out=None, keepdims=False, 48 initial=_NoValue, where=True): ---> 49 return umr_sum(a, axis, dtype, out, keepdims, initial, where) TypeError: can only concatenate str (not "int") to str
dtype('O')| Patient ID | Age | BMI | Diagnosis | Blood Pressure | |
|---|---|---|---|---|---|
| 0 | P0001 | 64 | 23.416480 | NaN | 193 |
| 1 | P0002 | 49 | 30.539825 | NaN | 163 |
| 2 | P0003 | 68 | 31.654859 | Hypertension | 201 |
| 3 | P0004 | 22 | 27.203652 | Hypertension | 109 |
| 4 | P0005 | 42 | 27.203652 | Hypertension | 149 |
| Patient ID | Age | BMI | Diagnosis | Blood Pressure | |
|---|---|---|---|---|---|
| 25 | P0026 | 51 | 27.203652 | Diabetes | 167 |
| 26 | P0027 | 65 | 23.628654 | Diabetes | 195 |
| 27 | P0028 | 46 | 22.254053 | NaN | 157 |
| 28 | P0029 | 39 | 28.550565 | Hypertension | 143 |
| 29 | P0030 | 60 | 27.203652 | Diabetes | 185 |
dtype('O')0 23.416480 1 30.539825 2 31.654859 3 27.203652 4 27.203652 5 27.203652 6 20.154990 7 21.675982 8 31.285261 9 33.454012 10 27.203652 11 27.203652 12 27.203652 13 31.369513 14 27.203652 15 27.203652 16 27.203652 17 26.752736 18 27.885888 19 29.278378 20 23.506374 21 27.203652 22 29.850859 23 27.203652 24 27.203652 25 27.203652 26 23.628654 27 22.254053 28 28.550565 29 27.203652 Name: BMI, dtype: float64
0 0.245243 1 0.780872 2 0.864715 3 0.530014 4 0.530014 5 0.530014 6 0.000000 7 0.114369 8 0.836924 9 1.000000 10 0.530014 11 0.530014 12 0.530014 13 0.843259 14 0.530014 15 0.530014 16 0.530014 17 0.496108 18 0.581313 19 0.686019 20 0.252002 21 0.530014 22 0.729066 23 0.530014 24 0.530014 25 0.530014 26 0.261197 27 0.157836 28 0.631293 29 0.530014 Name: BMI, dtype: float64
1.0 0.0
array([21., 37., 53., 69.])
0 Old Home Age 1 Mature 2 Old Home Age 3 Adults 4 Mature 5 Adults 6 Adults 7 Mature 8 Mature 9 Adults 10 Old Home Age 11 Adults 12 Mature 13 Mature 14 Adults 15 Adults 16 Old Home Age 17 Mature 18 Adults 19 Mature 20 Old Home Age 21 Old Home Age 22 Mature 23 Mature 24 Adults 25 Mature 26 Old Home Age 27 Mature 28 Mature 29 Old Home Age Name: Age-binned, dtype: category Categories (3, object): ['Adults' < 'Mature' < 'Old Home Age']
| Patient ID | Age | BMI | Diagnosis | Blood Pressure | Age-binned | |
|---|---|---|---|---|---|---|
| 0 | P0001 | 64 | 0.245243 | NaN | 193 | Old Home Age |
| 1 | P0002 | 49 | 0.780872 | NaN | 163 | Mature |
| 2 | P0003 | 68 | 0.864715 | Hypertension | 201 | Old Home Age |
| 3 | P0004 | 22 | 0.530014 | Hypertension | 109 | Adults |
| 4 | P0005 | 42 | 0.530014 | Hypertension | 149 | Mature |
| 5 | P0006 | 29 | 0.530014 | Hypertension | 123 | Adults |
| 6 | P0007 | 21 | 0.000000 | Diabetes | 107 | Adults |
| 7 | P0008 | 51 | 0.114369 | NaN | 167 | Mature |
| 8 | P0009 | 42 | 0.836924 | NaN | 149 | Mature |
| 9 | P0010 | 35 | 1.000000 | NaN | 135 | Adults |
| Diabetes | Hypertension | |
|---|---|---|
| 0 | False | False |
| 1 | False | False |
| 2 | False | True |
| 3 | False | True |
| 4 | False | True |
| 5 | False | True |
| 6 | True | False |
| 7 | False | False |
| 8 | False | False |
| 9 | False | False |
| 10 | True | False |
| 11 | False | True |
| 12 | False | True |
| 13 | False | True |
| 14 | True | False |
| 15 | False | False |
| 16 | False | True |
| 17 | False | True |
| 18 | False | True |
| 19 | False | True |
| 20 | True | False |
| 21 | True | False |
| 22 | False | False |
| 23 | False | False |
| 24 | False | True |
| 25 | True | False |
| 26 | True | False |
| 27 | False | False |
| 28 | False | True |
| 29 | True | False |
K-Means Clustering¶
Problem Statement¶
The following features are available for California houses in a specific locality obtained from 1990 census data;
- Longitude
- Latitude
- Housing Median Age
- Total Rooms
- Total Bedrooms
- Population
- Households
- Median Income
- Median House Value
- Ocean Proximity
Create clusters/groups of houses based on selected set of features.
Acknowledgement / Source¶
Importing Libraries¶
Loading the Dataset¶
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
| 1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
| 2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
| 3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
| 4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
| longitude | latitude | median_house_value | |
|---|---|---|---|
| 0 | -122.23 | 37.88 | 452600.0 |
| 1 | -122.22 | 37.86 | 358500.0 |
| 2 | -122.24 | 37.85 | 352100.0 |
| 3 | -122.25 | 37.85 | 341300.0 |
| 4 | -122.25 | 37.85 | 342200.0 |
(20640, 3)
Visualize the Data¶
<Axes: xlabel='longitude', ylabel='latitude'>
Pre-Processing¶
Model¶
KMeans(n_clusters=3, n_init='auto', random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=3, n_init='auto', random_state=0)
<Axes: xlabel='longitude', ylabel='latitude'>
<Axes: ylabel='median_house_value'>
Silhouette (si·loo·et) Score¶
- Scores closer to 1: Indicate well-separated clusters, suggesting the clustering is likely effective in capturing the underlying structure in the data.
- Scores around 0: Indicate clusters with some overlap, and you might consider adjusting the number of clusters or the clustering algorithm to see if you can achieve better separation.
- Negative scores: Suggest that some data points are potentially assigned to the wrong cluster, and you might need to explore alternative clustering strategies.
0.7761558886704949
Choosing the Number of Clusters¶
2 3 4 5 6 7
<Axes: xlabel='longitude', ylabel='latitude'>
<Axes: xlabel='longitude', ylabel='latitude'>
<Axes: xlabel='longitude', ylabel='latitude'>
<Axes: >
<Axes: xlabel='longitude', ylabel='latitude'>
<Axes: ylabel='median_house_value'>
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
| 1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
| 2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
| 3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
| 4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
Assignment¶
- Identify among the following, which columns have missing values.
- housing_median_age
- total_rooms
- total_bedrooms
- population
- Handle the missing values
- Normalize the data
- Cluster the data into 4 classes
- Using PCA reduce the dimesions from 4 to 2
- Visualize the original clusters using the 2 dimension obtained via PCA
| housing_median_age | total_rooms | total_bedrooms | population | |
|---|---|---|---|---|
| count | 20640.000000 | 20640.000000 | 20433.000000 | 20640.000000 |
| mean | 28.639486 | 2635.763081 | 537.870553 | 1425.476744 |
| std | 12.585558 | 2181.615252 | 421.385070 | 1132.462122 |
| min | 1.000000 | 2.000000 | 1.000000 | 3.000000 |
| 25% | 18.000000 | 1447.750000 | 296.000000 | 787.000000 |
| 50% | 29.000000 | 2127.000000 | 435.000000 | 1166.000000 |
| 75% | 37.000000 | 3148.000000 | 647.000000 | 1725.000000 |
| max | 52.000000 | 39320.000000 | 6445.000000 | 35682.000000 |
C:\Users\SINDH\AppData\Local\Temp\ipykernel_8148\4080736814.py:1: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466 import pandas as pd
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
| 1 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
| 2 | 1 | ? | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
| 3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
| 4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
5 rows × 26 columns
| symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | ... | engine-size | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
| 1 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
| 2 | 1 | ? | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
| 3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
| 4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
5 rows × 26 columns
| symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | ... | engine-size | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
| 1 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
| 2 | 1 | ? | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
| 3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
| 4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
| 5 | 2 | ? | audi | gas | std | two | sedan | fwd | front | 99.8 | ... | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 15250 |
| 6 | 1 | 158 | audi | gas | std | four | sedan | fwd | front | 105.8 | ... | 136 | mpfi | 3.19 | 3.40 | 8.5 | 110 | 5500 | 19 | 25 | 17710 |
7 rows × 26 columns
symboling int64 normalized-losses object make object fuel-type object aspiration object num-of-doors object body-style object drive-wheels object engine-location object wheel-base float64 length float64 width float64 height float64 curb-weight int64 engine-type object num-of-cylinders object engine-size int64 fuel-system object bore object stroke object compression-ratio float64 horsepower object peak-rpm object city-mpg int64 highway-mpg int64 Price object dtype: object
0 ?
1 ?
2 ?
3 164
4 164
...
200 95
201 95
202 95
203 95
204 95
Name: normalized-losses, Length: 205, dtype: object0 NaN
1 NaN
2 NaN
3 164.0
4 164.0
...
200 95.0
201 95.0
202 95.0
203 95.0
204 95.0
Name: normalized-losses, Length: 205, dtype: float64122.0
0 122.0
1 122.0
2 122.0
3 164.0
4 164.0
...
200 95.0
201 95.0
202 95.0
203 95.0
204 95.0
Name: normalized-losses, Length: 205, dtype: float64dtype('float64')| normalized-losses | make | |
|---|---|---|
| 0 | 122.0 | alfa-romero |
| 1 | 122.0 | alfa-romero |
| 2 | 122.0 | alfa-romero |
| 3 | 164.0 | audi |
| 4 | 164.0 | audi |
| ... | ... | ... |
| 200 | 95.0 | volvo |
| 201 | 95.0 | volvo |
| 202 | 95.0 | volvo |
| 203 | 95.0 | volvo |
| 204 | 95.0 | volvo |
205 rows × 2 columns
| normalized-losses | make | |
|---|---|---|
| 0 | 122.0 | alfa-romero |
| 1 | 122.0 | alfa-romero |
| 2 | 122.0 | alfa-romero |
| 3 | 164.0 | audi |
| 4 | 164.0 | audi |
| ... | ... | ... |
| 200 | 95.0 | volvo |
| 201 | 95.0 | volvo |
| 202 | 95.0 | volvo |
| 203 | 95.0 | volvo |
| 204 | 95.0 | volvo |
205 rows × 2 columns
make alfa-romero 122.000000 audi 144.285714 bmw 156.000000 chevrolet 100.000000 dodge 133.444444 honda 103.000000 isuzu 122.000000 jaguar 129.666667 mazda 123.705882 mercedes-benz 110.000000 mercury 122.000000 mitsubishi 140.615385 nissan 135.166667 peugot 146.818182 plymouth 128.000000 porsche 134.800000 renault 122.000000 saab 127.000000 subaru 92.250000 toyota 110.656250 volkswagen 121.500000 volvo 91.454545 Name: normalized-losses, dtype: float64
| symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | ... | engine-size | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 122.0 | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
| 1 | 3 | 122.0 | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
| 2 | 1 | 122.0 | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
| 3 | 2 | 164.0 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
| 4 | 2 | 164.0 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
5 rows × 26 columns
dtype('float64')0 88.6
1 88.6
2 94.5
3 99.8
4 99.4
...
200 109.1
201 109.1
202 109.1
203 109.1
204 109.1
Name: wheel-base, Length: 205, dtype: float64| symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | ... | engine-size | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 200 | -1 | 95.0 | volvo | gas | std | four | sedan | rwd | front | 109.1 | ... | 141 | mpfi | 3.78 | 3.15 | 9.5 | 114 | 5400 | 23 | 28 | 16845 |
| 201 | -1 | 95.0 | volvo | gas | turbo | four | sedan | rwd | front | 109.1 | ... | 141 | mpfi | 3.78 | 3.15 | 8.7 | 160 | 5300 | 19 | 25 | 19045 |
| 202 | -1 | 95.0 | volvo | gas | std | four | sedan | rwd | front | 109.1 | ... | 173 | mpfi | 3.58 | 2.87 | 8.8 | 134 | 5500 | 18 | 23 | 21485 |
| 203 | -1 | 95.0 | volvo | diesel | turbo | four | sedan | rwd | front | 109.1 | ... | 145 | idi | 3.01 | 3.40 | 23.0 | 106 | 4800 | 26 | 27 | 22470 |
| 204 | -1 | 95.0 | volvo | gas | turbo | four | sedan | rwd | front | 109.1 | ... | 141 | mpfi | 3.78 | 3.15 | 9.5 | 114 | 5400 | 19 | 25 | 22625 |
5 rows × 26 columns
0 21
1 21
2 19
3 24
4 18
..
200 23
201 19
202 18
203 26
204 19
Name: city-mpg, Length: 205, dtype: int640 11.190476
1 11.190476
2 12.368421
3 9.791667
4 13.055556
...
200 10.217391
201 12.368421
202 13.055556
203 9.038462
204 12.368421
Name: city-mpg, Length: 205, dtype: float640 11.190476
1 11.190476
2 12.368421
3 9.791667
4 13.055556
...
200 10.217391
201 12.368421
202 13.055556
203 9.038462
204 12.368421
Name: c-L/100Km, Length: 205, dtype: float64| symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | ... | engine-size | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | c-L/100Km | highway-mpg | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 122.0 | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 11.190476 | 27 | 13495 |
| 1 | 3 | 122.0 | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 11.190476 | 27 | 16500 |
| 2 | 1 | 122.0 | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 12.368421 | 26 | 16500 |
| 3 | 2 | 164.0 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 9.791667 | 30 | 13950 |
| 4 | 2 | 164.0 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 13.055556 | 22 | 17450 |
5 rows × 26 columns
dtype('O')C:\Users\SINDH\AppData\Local\Temp\ipykernel_3076\2782822628.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. df["Price"].replace("?", np.nan, inplace = True)
0 13495.0
1 16500.0
2 16500.0
3 13950.0
4 17450.0
...
200 16845.0
201 19045.0
202 21485.0
203 22470.0
204 22625.0
Name: Price, Length: 205, dtype: float64dtype('float64')0 8.703704
1 8.703704
2 9.038462
3 7.833333
4 10.681818
...
200 8.392857
201 9.400000
202 10.217391
203 8.703704
204 9.400000
Name: h-L/100Km, Length: 205, dtype: float64| symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | ... | engine-size | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | c-L/100Km | h-L/100Km | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 122.0 | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 11.190476 | 8.703704 | 13495 |
| 1 | 3 | 122.0 | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 11.190476 | 8.703704 | 16500 |
| 2 | 1 | 122.0 | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 12.368421 | 9.038462 | 16500 |
| 3 | 2 | 164.0 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 9.791667 | 7.833333 | 13950 |
| 4 | 2 | 164.0 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 13.055556 | 10.681818 | 17450 |
5 rows × 26 columns
0 168.8
1 168.8
2 171.2
3 176.6
4 176.6
...
200 188.8
201 188.8
202 188.8
203 188.8
204 188.8
Name: length, Length: 205, dtype: float640 0.811148
1 0.811148
2 0.822681
3 0.848630
4 0.848630
...
200 0.907256
201 0.907256
202 0.907256
203 0.907256
204 0.907256
Name: length, Length: 205, dtype: float641.0
0 64.1
1 64.1
2 65.5
3 66.2
4 66.4
...
200 68.9
201 68.8
202 68.9
203 68.9
204 68.9
Name: width, Length: 205, dtype: float6460.3 72.3
0 0.316667
1 0.316667
2 0.433333
3 0.491667
4 0.508333
...
200 0.716667
201 0.708333
202 0.716667
203 0.716667
204 0.716667
Name: width, Length: 205, dtype: float641.0
0.0
0 -2.015483
1 -2.015483
2 -0.542200
3 0.235366
4 0.235366
...
200 0.726460
201 0.726460
202 0.726460
203 0.726460
204 0.726460
Name: height, Length: 205, dtype: float64 -2.4247287815509493 2.486215399755926
-2.4247287815509493
dtype('O')'?'
0 13495
1 16500
2 16500
3 13950
4 17450
...
200 16845
201 19045
202 21485
203 22470
204 22625
Name: Price, Length: 205, dtype: object0 0.316667
1 0.316667
2 0.433333
3 0.491667
4 0.508333
...
200 0.716667
201 0.708333
202 0.716667
203 0.716667
204 0.716667
Name: width, Length: 205, dtype: float64--------------------------------------------------------------------------- UFuncTypeError Traceback (most recent call last) Cell In[54], line 1 ----> 1 bins = np.linspace(min(df["Price"]), max(df["Price"]), 4) File ~\AppData\Roaming\Python\Python312\site-packages\numpy\core\function_base.py:129, in linspace(start, stop, num, endpoint, retstep, dtype, axis) 125 div = (num - 1) if endpoint else num 127 # Convert float/complex array scalars to float, gh-3504 128 # and make sure one can use variables that have an __array_interface__, gh-6634 --> 129 start = asanyarray(start) * 1.0 130 stop = asanyarray(stop) * 1.0 132 dt = result_type(start, stop, float(num)) UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U5'), dtype('float64')) -> None
array([0. , 0.33333333, 0.66666667, 1. ])
| width | width-binned | |
|---|---|---|
| 0 | 0.316667 | Low |
| 1 | 0.316667 | Low |
| 2 | 0.433333 | Medium |
| 3 | 0.491667 | Medium |
| 4 | 0.508333 | Medium |
| ... | ... | ... |
| 200 | 0.716667 | High |
| 201 | 0.708333 | High |
| 202 | 0.716667 | High |
| 203 | 0.716667 | High |
| 204 | 0.716667 | High |
205 rows × 2 columns
| diesel | gas | |
|---|---|---|
| 0 | False | True |
| 1 | False | True |
| 2 | False | True |
| 3 | False | True |
| 4 | False | True |
| ... | ... | ... |
| 200 | False | True |
| 201 | False | True |
| 202 | False | True |
| 203 | True | False |
| 204 | False | True |
205 rows × 2 columns
- 09_Exploratory_Data_Analysis.ipynb
- Untitled7.ipynb
- Untitled6.ipynb
- Untitled4.ipynb
- 10_Model_Development_Linear_Regression.ipynb
- 02_LogisticRegression.ipynb
- Untitled9.ipynb
- 04_KMeansClustering.ipynb
- 04_KMeansClustering_1.ipynb
- Assignment03.ipynb